The original authors from the project collected students’ information
of three different educational levels:
1. School,
2. College,
3. University.
The data collection was conducted by online
and physical surveys. Each one of the surveys form consisted of an
individual’s socio-demographic factors. The data collected was 1205 data
from the time period from December 10th, 2020 to February 5th, 2021 and
the 14 attributes collected were:
1. Gender: Gender type of student
2. Age: Age range of the
student
3. Education Level: Education institution level
4.
Institution Type: Education institution type
5. IT Student: Studying
as IT student or not
6. Location: whether student is located in town
or not
7. Load-shedding: Level of load shedding
8. Financial
Condition: Financial condition of family
9. Internet Type: Internet
type used mostly in device
10. Network Type: Network connectivity
type
11. Class Duration: Daily class duration
12. Self LMS:
Institution’s own LMS availability
13. Device: Device used mostly in
class
14. Adaptability Level: Adaptability level of the student
Predict the outcome of Students’ Adaptability Level Prediction in Online Education using Machine Learning Approaches. Our target column is (14) Adaptability Level: Adaptability level of the student.Our goal include 3 main parts below:
1) Univariate & Multivariate Analysis on features;
2) Data Preprocessing for ML MCA Prediction;
3) ML Model Generation - MCA
1. R setup
2. Knowing my data
3. Univariate Analysis
4. Multivariate Analysis
5. Preparing the data to MCA
6.
Contingency tables
7. Create MCA
At this step we will install or import libraries that are
necessary to this project
pack <- c("plotly",
"tidyverse",
"ggrepel",
"knitr", "kableExtra",
"sjPlot",
"FactoMineR",
"amap",
"ade4",
"readxl",
"viridis")
if(sum(as.numeric(!pack%in%installed.packages()))!= 0){
installing <- pacotes[!pack %in% installed.packages()]
for(i in 1:length(installing)) {
install.packages(installing, dependencies = T)
break()}
sapply(pack, require, character = T)
} else {
sapply(pack, require, character = T)
}
## plotly tidyverse ggrepel knitr kableExtra sjPlot FactoMineR
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## amap ade4 readxl viridis
## TRUE TRUE TRUE TRUE
The next step inside our setup is to load the data. The data was
stored at .csv format and in the same folder where this script is
saved.
original_data <- read.csv("students_adaptability_level_online_education.csv")
At this stage I would like to know:
Column names and its
position identification
Type of variables that were registered (as
we will work with AC we need to identify if conversion is needed)
There is any NA and where
What are the unique values
#Column names and its position identification
for (c in 1:14){
print (paste(c,"","is","",colnames(original_data)[c]))
}
## [1] "1 is Gender"
## [1] "2 is Age"
## [1] "3 is Education.Level"
## [1] "4 is Institution.Type"
## [1] "5 is IT.Student"
## [1] "6 is Location"
## [1] "7 is Load.shedding"
## [1] "8 is Financial.Condition"
## [1] "9 is Internet.Type"
## [1] "10 is Network.Type"
## [1] "11 is Class.Duration"
## [1] "12 is Self.Lms"
## [1] "13 is Device"
## [1] "14 is Adaptivity.Level"
# For variable types, easy way
summary(original_data)
## Gender Age Education.Level Institution.Type
## Length:1205 Length:1205 Length:1205 Length:1205
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## IT.Student Location Load.shedding Financial.Condition
## Length:1205 Length:1205 Length:1205 Length:1205
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Internet.Type Network.Type Class.Duration Self.Lms
## Length:1205 Length:1205 Length:1205 Length:1205
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Device Adaptivity.Level
## Length:1205 Length:1205
## Class :character Class :character
## Mode :character Mode :character
# There is a NA?
for (c in colnames(original_data)){
for (i in 1:nrow(original_data)){
if (is.na(original_data[i, c])){
print(paste("There is a NA at",i,c))
break
}else{
next
}
}
}
There is no NA on the data set.
#Unique values
for (c in colnames(original_data)){
print(paste("To column ",c))
print(paste(unique(original_data[[c]])))
}
## [1] "To column Gender"
## [1] "Boy" "Girl"
## [1] "To column Age"
## [1] "21-25" "16-20" "11-15" "26-30" "6-10" "1-5"
## [1] "To column Education.Level"
## [1] "University" "College" "School"
## [1] "To column Institution.Type"
## [1] "Non Government" "Government"
## [1] "To column IT.Student"
## [1] "No" "Yes"
## [1] "To column Location"
## [1] "Yes" "No"
## [1] "To column Load.shedding"
## [1] "Low" "High"
## [1] "To column Financial.Condition"
## [1] "Mid" "Poor" "Rich"
## [1] "To column Internet.Type"
## [1] "Wifi" "Mobile Data"
## [1] "To column Network.Type"
## [1] "4G" "3G" "2G"
## [1] "To column Class.Duration"
## [1] "3-6" "1-3" "0"
## [1] "To column Self.Lms"
## [1] "No" "Yes"
## [1] "To column Device"
## [1] "Tab" "Mobile" "Computer"
## [1] "To column Adaptivity.Level"
## [1] "Moderate" "Low" "High"
slices_gender <- c(sum(original_data[1]=='Girl'),sum(original_data[1]=='Boy'))
pcts <- round((slices_gender/sum(slices_gender)*100))
lbls <- c("Girl","Boy")
lbls <- paste(lbls, pcts) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices_gender,labels = lbls, main="Gender Distribution\n(Gender type of student)")
slices_inst <- c(sum(original_data[4]=='Non Government'),sum(original_data[4]=='Government'))
pcs <- round((slices_inst/sum(slices_inst)*100))
lbls <- c("Non Government","Government")
lbls <- paste(lbls, pcs) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(slices_inst,labels = lbls, col=rainbow(length(lbls)), main="Education institution\n(Management Type)")
From this Univariate analysis is possible to conclude that:
a)Male accounts for 55% and Female separately accounts for 45%;
b)Regarding the institution type is possible to verify that 68,3% of
the data set were made by students from Non Government managed
institutions.
bar_data <- c(sum(original_data[2] == "21-25"),sum(original_data[2] =="16-20"),sum(original_data[2] == "11-15"),sum(original_data[2] == "26-30"), sum(original_data[2] == "6-10"), sum(original_data[2] == "1-5"))
lbsB <- c("21-25", "16-20", "11-15", "26-30", "6-10","1-5")
barplot(bar_data,names.arg=lbsB,xlab="Age",ylab="Count",col=viridis(length(lbsB)),
main="Age chart",border="white")
bar_data_class <- c(sum(original_data[11] == "3-6"),sum(original_data[11] =="1-3"),sum(original_data[11] == "0"))
lbsC <- c("3-6", "1-3", "0")
barplot(bar_data_class,names.arg=lbsC,xlab="Duration",ylab="Count",col=viridis(length(lbsC)),
main="Class Duration",border="white")
From this Univariate analysis is possible to conclude that:
a) Age of the respondents is mainly distributed between 11 and 25;
b)and the Class Duration mainly distributed between 1-3 hours.
slices_ed_lev <- c(sum(original_data[3]=='College'),sum(original_data[3]=='3G'),sum(original_data[3]=='School'))
pcsEd <- round((slices_ed_lev/sum(slices_ed_lev)*100))
lblsEd <- c("College","University","School")
lblsEd <- paste(lblsEd, pcsEd,"%", sep = " ") # add percents to labels
pie(slices_ed_lev,labels = lblsEd, col=rainbow(length(lblsEd)), main="Education Level")
slices_fin <- c(sum(original_data[8]== "Mid" ),sum(original_data[8]== "Poor" ),sum(original_data[8]== "Rich"))
pcsF <- round((slices_fin/sum(slices_fin)*100))
lblsF <- c("Mid","Poor","Rich")
lblsF <- paste(lblsF, pcsF,"%", sep = " ") # add percents to labels
pie(slices_fin,labels = lblsF, col=rainbow(length(lblsF)), main="Financial Condition")
slices_NT <- c(sum(original_data[10]=='4G'),sum(original_data[10]=='3G'),sum(original_data[10]=='2G'))
pcsNT <- round((slices_NT/sum(slices_NT)*100))
lblsNT <- c("4G","3G","2G")
lblsNT <- paste(lblsNT, pcsNT,"%", sep = " ") # add percents to labels
pie(slices_NT,labels = lblsNT, col=rainbow(length(lblsNT)), main="Network Type")
From this Univariate analysis is possible to conclude that most of
the respondents:
a) Attend School and University
b) Are situated
in Mid financial condition
c)Make use of 4G connection. Also is
important to note that, there are just 1.58% of students using 2G
network.
slices_IT <- c(sum(original_data[5]== "Yes" ),sum(original_data[5]== "No" ))
pcsIT <- round((slices_IT/sum(slices_IT)*100))
lblsIT <- c("Yes","No")
lblsIT <- paste(lblsIT, pcsIT,"%", sep = " ") # add percents to labels
pie(slices_IT,labels = lblsIT, col=rainbow(length(lblsIT)), main= "IT student")
slices_Loc <- c(sum(original_data[6]== "Yes" ),sum(original_data[6]== "No" ))
pcsLoc <- round((slices_Loc/sum(slices_Loc)*100))
lblsLoc <- c("Yes","No")
lblsLoc <- paste(lblsLoc, pcsLoc,"%", sep = " ") # add percents to labels
pie(slices_Loc,labels = lblsLoc, col=rainbow(length(lblsLoc)), main= "Location")
slices_SLm <- c(sum(original_data[12]== "Yes" ),sum(original_data[12]== "No" ))
pcsSLm <- round((slices_SLm/sum(slices_SLm)*100))
lblsSLm <- c("Yes","No")
lblsSLm <- paste(lblsSLm, pcsSLm,"%", sep = " ") # add percents to labels
pie(slices_SLm,labels = lblsSLm, col=rainbow(length(lblsSLm)), main= "Self LMS")
From this Univariate analysis is possible to conclude that:
a) Around 25.2% of respondents are IT student
b) 77.6% of them
are located in town
c)17.4% of their institutions own LMS
availability
slices_L_s <- c(sum(original_data[7]== "High" ),sum(original_data[7]== "Low" ))
pcs_L_s <- round((slices_L_s/sum(slices_L_s)*100))
lbls_L_s <- c("Low","High")
lbls_L_s <- paste(lbls_L_s, pcs_L_s,"%", sep = " ") # add percents to labels
pie(slices_L_s,labels = lbls_L_s, col=rainbow(length(lbls_L_s)), main= "Load-shedding")
slices_Inter_typ <- c(sum(original_data[9]== "Wifi" ),sum(original_data[9]== "Mobile Data" ))
pcsInter_typ <- round((slices_Inter_typ/sum(slices_Inter_typ)*100))
lblsInter_typ <- c("Wifi","Mobile Data")
lblsInter_typ <- paste(lblsInter_typ, pcsInter_typ,"%", sep = " ") # add percents to labels
pie(slices_Inter_typ,labels = lblsInter_typ, col=rainbow(length(lblsInter_typ)), main= "Internet type")
slices_D <- c(sum(original_data[13]== "Tab" ),sum(original_data[13]== "Mobile" ),sum(original_data[13]== "Computer" ))
pcsD <- round((slices_D/sum(slices_D)*100))
lblsD <- c("Tab","Mobile", "Computer")
lblsD <- paste(lblsD, pcsD,"%", sep = " ") # add percents to labels
pie(slices_D,labels = lblsD, col=rainbow(length(lblsD)), main= "Device type")
From this Univariate analysis is possible to conclude that:
a) Load-shedding refers to level of load shedding, and type ‘low’
accounts for 83.3%
b) 57.7% of respondents use mobile data to take
classes
c) 84.1% of the respondents use mobile.
slices_Ad <- c(sum(original_data[14]== "Moderate" ),sum(original_data[14]== "Low" ),sum(original_data[14]== "High" ))
pcsAd <- round((slices_Ad/sum(slices_Ad)*100))
lblsAd <- c("Moderate","Low","High")
lblsAd <- paste(lblsAd, pcsAd,"%", sep = " ") # add percents to labels
pie(slices_Ad,labels = lblsAd, col=rainbow(length(lblsAd)), main= "Adaptivity Level")
Adaptability level refers to adaptability level of the student
during online education. It can be seen that about 52% respond that they
have moderate adaptability, while low accounts for 40%, high responds
for 8% for this sample.
tt <- table(original_data[,1:2]) # Create a contingency table
df_tt <- data.frame(expand.grid(row.names(tt),colnames(tt)),c(tt)) # transofrm into a data frame
colnames(df_tt) <- c("Gender","Age","Count.t")
ggplot(df_tt, aes(fill = Gender, y = Count.t,x = Age))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = Age, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Gender and Age Distribution")
The distribution of Age per gender is:
a) Female’s ditributed
mainly in between 11-15 and 15-20;
b) Male ditributed mainly in
between 11-15 and 21-25.
tt_2 <- table(original_data[,c(1,5)]) # Create a contingency table
df_tt_2 <- data.frame(expand.grid(row.names(tt_2),colnames(tt_2)),c(tt_2))
colnames(df_tt_2) <- c("Gender","IT","Count.t")
ggplot(df_tt_2, aes(fill = Gender, y = Count.t,x = IT))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = IT, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Gender and IT studies")
Is possible to notice that the number of males studying IT
compared to female, for this survey, is higher.
tt_3 <- table(original_data[,c(3,5)]) # Create a contingency table
df_tt_3 <- data.frame(expand.grid(row.names(tt_3),colnames(tt_3)),c(tt_3))
colnames(df_tt_3) <- c("Education.Level","IT","Count.t")
ggplot(df_tt_3, aes(fill = Education.Level, y = Count.t,x = IT))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = IT, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Education Level and IT studies")
The got the most IT students, whereas 30 and 27 IT students in
college and school.
tt_4 <- table(original_data[,c(1,14)]) # Create a contingency table
df_tt_4 <- data.frame(expand.grid(row.names(tt_4),colnames(tt_4)),c(tt_4))
colnames(df_tt_4) <- c("Gender","Adaptability","Count.t")
ggplot(df_tt_4, aes(fill = Gender, y = Count.t,x = Adaptability))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = Adaptability, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Gender and Adaptivity Level")
For this survey, 71 male respondents answered as they have high
adaptivity level in online education compared to 29 female.
Additionally, both male and female overall got moderate adaptivity
level.
tt_5 <- table(original_data[,c(2,14)]) # Create a contingency table
df_tt_5 <- data.frame(expand.grid(row.names(tt_5),colnames(tt_5)),c(tt_5))
colnames(df_tt_5) <- c("Age","Adaptability","Count.t")
ggplot(df_tt_5, aes(fill = Age, y = Count.t,x = Adaptability))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = Adaptability, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Age Distribution by Adaptivity Level")
a) 11-15, 21-25 and 26-30 high adaptivity levelif compared with the
other ages. None of 1-5 got high adataptivity.
fem <- original_data%>%
filter(Gender == "Girl")
tt_6 <- table(fem[,c(2,14)]) # Create a contingency table
tt_6
## Adaptivity.Level
## Age High Low Moderate
## 1-5 0 3 64
## 11-15 5 46 91
## 16-20 5 104 60
## 21-25 6 45 58
## 26-30 0 26 0
## 6-10 13 11 5
df_tt_6 <- data.frame(expand.grid(row.names(tt_6),colnames(tt_6)),c(tt_6))
colnames(df_tt_6) <- c("Age","Adaptability","Count.t")
ggplot(df_tt_6, aes(fill = Adaptability, y = Count.t,x = Age))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = Age, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Age Distribution by Adaptivity Level for Girls")
From this chart is possible observe that, for the Female gender
respondents for this survey:
a) Between the ages 26-30 all
respondents answer as having Low adaptivity
b) Bigger number of
responses related to the Moderate adaptativity were observed at the ages
between 1-5,11-5 and 21-25
c) A slightly higher adaptivity was
noticed at the age range of 6-10, being just to answers higher is not
possible afirm anything.
male <- original_data%>%
filter(Gender == "Boy")
tt_7 <- table(male[,c(2,14)]) # Create a contingency table
tt_7
## Adaptivity.Level
## Age High Low Moderate
## 1-5 0 14 0
## 11-15 23 74 114
## 16-20 0 40 69
## 21-25 32 94 139
## 26-30 12 10 20
## 6-10 4 13 5
df_tt_7 <- data.frame(expand.grid(row.names(tt_7),colnames(tt_7)),c(tt_7))
colnames(df_tt_7) <- c("Age","Adaptability","Count.t")
ggplot(df_tt_7, aes(fill = Adaptability, y = Count.t,x = Age))+
geom_bar(position="dodge", stat="identity")+
geom_text(aes(x = Age, y = as.numeric(Count.t), label = format(Count.t, digits = 1)),
size = 2.5,
position = position_dodge(.9),
inherit.aes = TRUE,
na.rm = TRUE, vjust = -1)+
ggtitle("Age Distribution by Adaptivity Level for Boys")
From this chart is possible observe that, for the Male gender
respondents for this survey:
a) Between the ages 26-30 majority of
the respondents answered as having Moderate adaptivity
b) Bigger
number of responses related to the Moderate adaptativity were observed
at the ages range between 11-15, 16-20,21-25 and 26-30
c) For the
age range of 1-5 all respondents answered Low
d) And for 6-10
majority of responses were low.
To input the data at MCA formula is needed to convert the data for factor format when applying dud.acm().
# Data convert to factor
original_data_factor <- as.data.frame(unclass(original_data), stringsAsFactors = TRUE)
The next step here is to extract the observed frequencies, we do it by summary().
summary(original_data_factor)
## Gender Age Education.Level Institution.Type IT.Student
## Boy :663 1-5 : 81 College :219 Government :382 No :901
## Girl:542 11-15:353 School :530 Non Government:823 Yes:304
## 16-20:278 University:456
## 21-25:374
## 26-30: 68
## 6-10 : 51
## Location Load.shedding Financial.Condition Internet.Type Network.Type
## No :270 High: 201 Mid :878 Mobile Data:695 2G: 19
## Yes:935 Low :1004 Poor:242 Wifi :510 3G:411
## Rich: 85 4G:775
##
##
##
## Class.Duration Self.Lms Device Adaptivity.Level
## 0 :154 No :995 Computer: 162 High :100
## 1-3:840 Yes:210 Mobile :1013 Low :480
## 3-6:211 Tab : 30 Moderate:625
##
##
##
We want to test if exist correlation between Adaptivity.Level
(referencial attribute) and the other attributes. So Adaptivity.Level
will be our row, and we need compare with each one of the others
attributes. As output we will have 4 tables where:
1.Numbers
displayed in blue are row percentage for
observed freq.
2.Numbers displayed in
green are col. percentage for observed
freq.
3.Numbers displayed in black are
observed freq.
4.Numbers displayed in dark
green are explained freq.
The hypothesis to be tested are:
H0: The association between the two categorical variables
is random.
H1: The association between the two categorical variables is
not random.
The Variables to have its association tested are: Gender;
Age; Education.Level; Institution.Type; IT.Student; Location;
Load.shedding; Financial.Condition; Internet.Type; Network.Type;
Class.Duration; Self.Lms; Device;
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Device,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Device | Total | ||
|---|---|---|---|---|
| Computer | Mobile | Tab | ||
| High |
30 13 30Â % 18.5Â % |
68 84 68Â % 6.7Â % |
2 2 2Â % 6.7Â % |
100 100 100Â % 8.3Â % |
| Low |
40 65 8.3Â % 24.7Â % |
438 404 91.2Â % 43.2Â % |
2 12 0.4Â % 6.7Â % |
480 480 100Â % 39.8Â % |
| Moderate |
92 84 14.7Â % 56.8Â % |
507 525 81.1Â % 50Â % |
26 16 4.2Â % 86.7Â % |
625 625 100Â % 51.9Â % |
| Total |
162 162 13.4Â % 100Â % |
1013 1013 84.1Â % 100Â % |
30 30 2.5Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=52.519 · df=4 · Cramer’s V=0.148 · Fisher’s p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Self.Lms,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Self.Lms | Total | |
|---|---|---|---|
| No | Yes | ||
| High |
70 83 70Â % 7Â % |
30 17 30Â % 14.3Â % |
100 100 100Â % 8.3Â % |
| Low |
428 396 89.2Â % 43Â % |
52 84 10.8Â % 24.8Â % |
480 480 100Â % 39.8Â % |
| Moderate |
497 516 79.5Â % 49.9Â % |
128 109 20.5Â % 61Â % |
625 625 100Â % 51.9Â % |
| Total |
995 995 82.6Â % 100Â % |
210 210 17.4Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=29.535 · df=2 · Cramer’s V=0.157 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Class.Duration,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Class.Duration | Total | ||
|---|---|---|---|---|
| 0 | 1-3 | 3-6 | ||
| High |
0 13 0Â % 0Â % |
82 70 82Â % 9.8Â % |
18 18 18Â % 8.5Â % |
100 100 100Â % 8.3Â % |
| Low |
144 61 30Â % 93.5Â % |
290 335 60.4Â % 34.5Â % |
46 84 9.6Â % 21.8Â % |
480 480 100Â % 39.8Â % |
| Moderate |
10 80 1.6Â % 6.5Â % |
468 436 74.9Â % 55.7Â % |
147 109 23.5Â % 69.7Â % |
625 625 100Â % 51.9Â % |
| Total |
154 154 12.8Â % 100Â % |
840 840 69.7Â % 100Â % |
211 211 17.5Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=225.918 · df=4 · Cramer’s V=0.306 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Network.Type,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Network.Type | Total | ||
|---|---|---|---|---|
| 2G | 3G | 4G | ||
| High |
0 2 0Â % 0Â % |
22 34 22Â % 5.4Â % |
78 64 78Â % 10.1Â % |
100 100 100Â % 8.3Â % |
| Low |
16 8 3.3Â % 84.2Â % |
186 164 38.8Â % 45.3Â % |
278 309 57.9Â % 35.9Â % |
480 480 100Â % 39.8Â % |
| Moderate |
3 10 0.5Â % 15.8Â % |
203 213 32.5Â % 49.4Â % |
419 402 67Â % 54.1Â % |
625 625 100Â % 51.9Â % |
| Total |
19 19 1.6Â % 100Â % |
411 411 34.1Â % 100Â % |
775 775 64.3Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=30.243 · df=4 · Cramer’s V=0.112 · Fisher’s p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Load.shedding,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Load.shedding | Total | |
|---|---|---|---|
| High | Low | ||
| High |
13 17 13Â % 6.5Â % |
87 83 87Â % 8.7Â % |
100 100 100Â % 8.3Â % |
| Low |
100 80 20.8Â % 49.8Â % |
380 400 79.2Â % 37.8Â % |
480 480 100Â % 39.8Â % |
| Moderate |
88 104 14.1Â % 43.8Â % |
537 521 85.9Â % 53.5Â % |
625 625 100Â % 51.9Â % |
| Total |
201 201 16.7Â % 100Â % |
1004 1004 83.3Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=9.972 · df=2 · Cramer’s V=0.091 · p=0.007 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Location,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Location | Total | |
|---|---|---|---|
| No | Yes | ||
| High |
8 22 8Â % 3Â % |
92 78 92Â % 9.8Â % |
100 100 100Â % 8.3Â % |
| Low |
171 108 35.6Â % 63.3Â % |
309 372 64.4Â % 33Â % |
480 480 100Â % 39.8Â % |
| Moderate |
91 140 14.6Â % 33.7Â % |
534 485 85.4Â % 57.1Â % |
625 625 100Â % 51.9Â % |
| Total |
270 270 22.4Â % 100Â % |
935 935 77.6Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=82.310 · df=2 · Cramer’s V=0.261 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$IT.Student,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | IT.Student | Total | |
|---|---|---|---|
| No | Yes | ||
| High |
67 75 67Â % 7.4Â % |
33 25 33Â % 10.9Â % |
100 100 100Â % 8.3Â % |
| Low |
391 359 81.5Â % 43.4Â % |
89 121 18.5Â % 29.3Â % |
480 480 100Â % 39.8Â % |
| Moderate |
443 467 70.9Â % 49.2Â % |
182 158 29.1Â % 59.9Â % |
625 625 100Â % 51.9Â % |
| Total |
901 901 74.8Â % 100Â % |
304 304 25.2Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=19.597 · df=2 · Cramer’s V=0.128 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Age,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Age | Total | |||||
|---|---|---|---|---|---|---|---|
| 1-5 | 11-15 | 16-20 | 21-25 | 26-30 | 6-10 | ||
| High |
0 7 0Â % 0Â % |
28 29 28Â % 7.9Â % |
5 23 5Â % 1.8Â % |
38 31 38Â % 10.2Â % |
12 6 12Â % 17.6Â % |
17 4 17Â % 33.3Â % |
100 100 100Â % 8.3Â % |
| Low |
17 32 3.5Â % 21Â % |
120 141 25Â % 34Â % |
144 111 30Â % 51.8Â % |
139 149 29Â % 37.2Â % |
36 27 7.5Â % 52.9Â % |
24 20 5Â % 47.1Â % |
480 480 100Â % 39.8Â % |
| Moderate |
64 42 10.2Â % 79Â % |
205 183 32.8Â % 58.1Â % |
129 144 20.6Â % 46.4Â % |
197 194 31.5Â % 52.7Â % |
20 35 3.2Â % 29.4Â % |
10 26 1.6Â % 19.6Â % |
625 625 100Â % 51.9Â % |
| Total |
81 81 6.7Â % 100Â % |
353 353 29.3Â % 100Â % |
278 278 23.1Â % 100Â % |
374 374 31Â % 100Â % |
68 68 5.6Â % 100Â % |
51 51 4.2Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=125.296 · df=10 · Cramer’s V=0.228 · Fisher’s p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Gender,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Gender | Total | |
|---|---|---|---|
| Boy | Girl | ||
| High |
71 55 71Â % 10.7Â % |
29 45 29Â % 5.4Â % |
100 100 100Â % 8.3Â % |
| Low |
245 264 51Â % 37Â % |
235 216 49Â % 43.4Â % |
480 480 100Â % 39.8Â % |
| Moderate |
347 344 55.5Â % 52.3Â % |
278 281 44.5Â % 51.3Â % |
625 625 100Â % 51.9Â % |
| Total |
663 663 55Â % 100Â % |
542 542 45Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=13.451 · df=2 · Cramer’s V=0.106 · p=0.001 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Education.Level,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Education.Level | Total | ||
|---|---|---|---|---|
| College | School | University | ||
| High |
3 18 3Â % 1.4Â % |
47 44 47Â % 8.9Â % |
50 38 50Â % 11Â % |
100 100 100Â % 8.3Â % |
| Low |
120 87 25Â % 54.8Â % |
182 211 37.9Â % 34.3Â % |
178 182 37.1Â % 39Â % |
480 480 100Â % 39.8Â % |
| Moderate |
96 114 15.4Â % 43.8Â % |
301 275 48.2Â % 56.8Â % |
228 237 36.5Â % 50Â % |
625 625 100Â % 51.9Â % |
| Total |
219 219 18.2Â % 100Â % |
530 530 44Â % 100Â % |
456 456 37.8Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=38.686 · df=4 · Cramer’s V=0.127 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Institution.Type,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Institution.Type | Total | |
|---|---|---|---|
| Government | Non Government | ||
| High |
20 32 20Â % 5.2Â % |
80 68 80Â % 9.7Â % |
100 100 100Â % 8.3Â % |
| Low |
234 152 48.8Â % 61.3Â % |
246 328 51.2Â % 29.9Â % |
480 480 100Â % 39.8Â % |
| Moderate |
128 198 20.5Â % 33.5Â % |
497 427 79.5Â % 60.4Â % |
625 625 100Â % 51.9Â % |
| Total |
382 382 31.7Â % 100Â % |
823 823 68.3Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=107.108 · df=2 · Cramer’s V=0.298 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Internet.Type,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Internet.Type | Total | |
|---|---|---|---|
| Mobile Data | Wifi | ||
| High |
36 58 36Â % 5.2Â % |
64 42 64Â % 12.5Â % |
100 100 100Â % 8.3Â % |
| Low |
288 277 60Â % 41.4Â % |
192 203 40Â % 37.6Â % |
480 480 100Â % 39.8Â % |
| Moderate |
371 360 59.4Â % 53.4Â % |
254 265 40.6Â % 49.8Â % |
625 625 100Â % 51.9Â % |
| Total |
695 695 57.7Â % 100Â % |
510 510 42.3Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=21.036 · df=2 · Cramer’s V=0.132 · p=0.000 |
sjt.xtab(var.row = original_data_factor$Adaptivity.Level,
var.col = original_data_factor$Financial.Condition,
show.exp = TRUE,
show.row.prc = TRUE,
show.col.prc = TRUE,
encoding = "UTF-8")
| Adaptivity.Level | Financial.Condition | Total | ||
|---|---|---|---|---|
| Mid | Poor | Rich | ||
| High |
36 73 36Â % 4.1Â % |
22 20 22Â % 9.1Â % |
42 7 42Â % 49.4Â % |
100 100 100Â % 8.3Â % |
| Low |
341 350 71Â % 38.8Â % |
129 96 26.9Â % 53.3Â % |
10 34 2.1Â % 11.8Â % |
480 480 100Â % 39.8Â % |
| Moderate |
501 455 80.2Â % 57.1Â % |
91 126 14.6Â % 37.6Â % |
33 44 5.3Â % 38.8Â % |
625 625 100Â % 51.9Â % |
| Total |
878 878 72.9Â % 100Â % |
242 242 20.1Â % 100Â % |
85 85 7.1Â % 100Â % |
1205 1205 100Â % 100Â % |
χ2=236.865 · df=4 · Cramer’s V=0.314 · p=0.000 |
Looking at all p-values, is possible to reject H0, so is possible to say that there is a not random association between all the categorical variables and the Adaptivity.Level.
Thus for the MCA we will keep all the attributes
MCA <- dudi.acm(original_data_factor, scannf = FALSE, nf = 3)
##nf = 3 means that the coordinates extracted are referent to 3 dimensions (the 3 that have bigger part. on the variance), I choose it because I want to plot a 3D graph.
var_perc <- MCA$eig/sum(MCA$eig)*100
paste0(round(var_perc,2),"%")
## [1] "14.6%" "10.91%" "8.48%" "7.45%" "5.78%" "5.59%" "5.11%" "4.24%"
## [9] "4.2%" "3.88%" "3.69%" "3.26%" "3.15%" "2.77%" "2.48%" "2.34%"
## [17] "2.27%" "2.05%" "1.97%" "1.81%" "1.45%" "1.36%" "0.96%" "0.19%"
Were generated 24 dimensions because the maximal number of dimensions
is given by number of attributes(J = 38) subtracted by the number of
variables(Q = 14).
##### Number of attributes per variable
numb_attr <- apply(original_data_factor,
MARGIN = 2,
FUN = function(x) nlevels(as.factor(x)))
#R uses the number of attributes to generate an data frame with the coordinates of the Binary or Burt Matrix
Binary Matriz Method
df_MCA <- data.frame(MCA$c1, Variable = rep(names(numb_attr), numb_attr))
df_MCA %>%
rownames_to_column() %>%
rename(Attribute = 1)%>%
ggplot(aes(x = CS1, y = CS2, label = Attribute, color = Variable))+
geom_point()+
geom_label_repel()+
geom_vline(aes(xintercept = 0), linetype = "longdash", color = "grey48") +
geom_hline(aes(yintercept = 0), linetype = "longdash", color = "grey48") +
labs(x = paste("Dimension 1:", paste0(round(var_perc[1], 2), "%")),
y = paste("Dimension 2:", paste0(round(var_perc[2], 2), "%"))) +
theme_bw()
## Warning: ggrepel: 12 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
MCA_3D <- plot_ly()
MCA_3D <- add_trace(p = MCA_3D,
x = df_MCA$CS1,
y = df_MCA$CS2,
z = df_MCA$CS3,
mode = "text",
text = rownames(df_MCA),
textfont = list(color = "blue"),
marker = list(color = "red"),
showlegend = FALSE)
MCA_3D
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## A marker object has been specified, but markers is not in the mode
## Adding markers to the mode...
Exists an association between:
a) Adaptivity.Level Low and
Institution.Type.Government, Network.Type 2G, Age 26-30, Load.shedding
High, Device Mobile, Location.No (out of Town), Gender.Girl.
b)Adaptivity.Level Low and Institution.Type.Non Government,
Network.Type 4G, Class.Duration 1.3h, Load.shedding Low,
Location.Yes(Town), Gender.Boy, Financial.Condition Mid.
c)Adaptivity.Level High, Financial.Condition Rich.